Data wrangling
En ocasiones nos encontramos con ficheros de datos desarreglado, lo que puede llegar a complicar el análisis de los datos. Por esto es necesario transformar o mapear datos de una estructura inicial a un formato más conveniente para poder utilizar dichos datos. En R lo hacemos con ayuda de la librería tidyr
.
library(tidyr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.2
base <- read.csv("../Bases de datos/paises.csv")
kable(head(base),"markdown")
Afghanistan |
1999 |
Cases |
745 |
Afghanistan |
1999 |
Population |
19987071 |
Afghanistan |
2000 |
Cases |
2666 |
Afghanistan |
2000 |
Population |
20595360 |
Brazil |
1999 |
Cases |
37737 |
Brazil |
1999 |
Population |
1720063562 |
Spread
Permite dividir las columnas clave y valor en nuevas columnas.Por lo general se reducen la cantidad de filas y aumenta la cantidad de columnas.
basetidy <- spread(base, key = key, value = value)
kable(basetidy, "markdown")
Afghanistan |
1999 |
745 |
19987071 |
Afghanistan |
2000 |
2666 |
20595360 |
Brazil |
1999 |
37737 |
1720063562 |
Brazil |
2000 |
80488 |
17450898 |
China |
1999 |
212258 |
1272915272 |
China |
2000 |
213766 |
1280428583 |
Gather
Permite colapsar múltiples columnas en pares de clave valor. Por lo general se reducen la cantidad de columnas y aumenta la cantidad de filas.
basemessy <- gather(basetidy, key = 'llave', value = 'Valor', 3:4)
kable(basemessy,"markdown")
Afghanistan |
1999 |
Cases |
745 |
Afghanistan |
2000 |
Cases |
2666 |
Brazil |
1999 |
Cases |
37737 |
Brazil |
2000 |
Cases |
80488 |
China |
1999 |
Cases |
212258 |
China |
2000 |
Cases |
213766 |
Afghanistan |
1999 |
Population |
19987071 |
Afghanistan |
2000 |
Population |
20595360 |
Brazil |
1999 |
Population |
1720063562 |
Brazil |
2000 |
Population |
17450898 |
China |
1999 |
Population |
1272915272 |
China |
2000 |
Population |
1280428583 |
Separate
Separar una columna en multiples columnas, indicanco un caracter como delimitador.
rate <- read.csv("../Bases de datos/rate.csv")
kable(head(rate),"markdown")
Afghanistan |
1999 |
745/19987071 |
Afghanistan |
2000 |
2666/20595360 |
Brazil |
1999 |
37737/1720063562 |
Brazil |
2000 |
80488/17450898 |
China |
1999 |
212258/1272915272 |
China |
2000 |
213766/1280428583 |
ratediv <- separate(rate, col = "rate", into = c("tasa","poblacion"),sep = "/" )
kable(ratediv,"markdown")
Afghanistan |
1999 |
745 |
19987071 |
Afghanistan |
2000 |
2666 |
20595360 |
Brazil |
1999 |
37737 |
1720063562 |
Brazil |
2000 |
80488 |
17450898 |
China |
1999 |
212258 |
1272915272 |
China |
2000 |
213766 |
1280428583 |
Unite
Une multiples columnas en una sola columna, indicanco un caracter como delimitador.
kable(unite(ratediv, rate, c("tasa","poblacion"), sep ="/"),"markdown")
Afghanistan |
1999 |
745/19987071 |
Afghanistan |
2000 |
2666/20595360 |
Brazil |
1999 |
37737/1720063562 |
Brazil |
2000 |
80488/17450898 |
China |
1999 |
212258/1272915272 |
China |
2000 |
213766/1280428583 |
Dplyr
Permite manipulara la infromación como en SQL. Este paquete presenta las siguientes funciones:
- select extraer variables existente.
- filter extraer observaciones especificadas.
- mutate permite crear nuevas variables.
- summarise agregar o modificar el objeto de estudio.
- arrange ordenar los datos.
- group_by agrupaciones de datos.
Select
library(dplyr)
protein <- read.csv("../Bases de datos/protein.csv")
kable(head(protein),"markdown")
Albania |
10.1 |
1.4 |
0.5 |
8.9 |
0.2 |
42.3 |
0.6 |
5.5 |
1.7 |
Austria |
8.9 |
14.0 |
4.3 |
19.9 |
2.1 |
28.0 |
3.6 |
1.3 |
4.3 |
Belgium |
13.5 |
9.3 |
4.1 |
17.5 |
4.5 |
26.6 |
5.7 |
2.1 |
4.0 |
Bulgaria |
7.8 |
6.0 |
1.6 |
8.3 |
1.2 |
56.7 |
1.1 |
3.7 |
4.2 |
Czechoslovakia |
9.7 |
11.4 |
2.8 |
12.5 |
2.0 |
34.3 |
5.0 |
1.1 |
4.0 |
Denmark |
10.6 |
10.8 |
3.7 |
25.0 |
9.9 |
21.9 |
4.8 |
0.7 |
2.4 |
kable(head(select(protein, RedMeat,Eggs),3), "markdown")
10.1 |
0.5 |
8.9 |
4.3 |
13.5 |
4.1 |
kable(head(select(protein, 2:5),3),"markdown")
10.1 |
1.4 |
0.5 |
8.9 |
8.9 |
14.0 |
4.3 |
19.9 |
13.5 |
9.3 |
4.1 |
17.5 |
kable(head(select(protein, Country, starts_with('C')),3), "markdown")# Variables que empiezan con C
Albania |
42.3 |
Austria |
28.0 |
Belgium |
26.6 |
- contains El nombre de la variable coincide con cierto patrón
- matches El nombre de la variable coincide con alguna expresión regular
- everything Selecciona todas las variable.
auto <- read.csv("../Bases de datos/auto-mpg.csv")
kable(head(auto),"markdown")
1 |
28 |
4 |
140 |
90 |
2264 |
15.5 |
71 |
chevrolet vega 2300 |
2 |
19 |
3 |
70 |
97 |
2330 |
13.5 |
72 |
mazda rx2 coupe |
3 |
36 |
4 |
107 |
75 |
2205 |
14.5 |
82 |
honda accord |
4 |
28 |
4 |
97 |
92 |
2288 |
17.0 |
72 |
datsun 510 (sw) |
5 |
21 |
6 |
199 |
90 |
2648 |
15.0 |
70 |
amc gremlin |
6 |
23 |
4 |
115 |
95 |
2694 |
15.0 |
75 |
audi 100ls |
Filter
a1 <- filter(auto, auto$cylinders == 4)
kable(head(a1),"markdown")
1 |
28.0 |
4 |
140 |
90 |
2264 |
15.5 |
71 |
chevrolet vega 2300 |
3 |
36.0 |
4 |
107 |
75 |
2205 |
14.5 |
82 |
honda accord |
4 |
28.0 |
4 |
97 |
92 |
2288 |
17.0 |
72 |
datsun 510 (sw) |
6 |
23.0 |
4 |
115 |
95 |
2694 |
15.0 |
75 |
audi 100ls |
8 |
32.9 |
4 |
119 |
100 |
2615 |
14.8 |
81 |
datsun 200sx |
14 |
27.9 |
4 |
156 |
105 |
2800 |
14.4 |
80 |
dodge colt |
a2 <- filter(auto, auto$cylinders == 4 & auto$horsepower < 90)
kable(head(a2),"markdown")
3 |
36.0 |
4 |
107 |
75 |
2205 |
14.5 |
82 |
honda accord |
16 |
23.8 |
4 |
151 |
85 |
2855 |
17.6 |
78 |
oldsmobile starfire sx |
17 |
29.0 |
4 |
90 |
70 |
1937 |
14.2 |
76 |
vw rabbit |
20 |
29.0 |
4 |
135 |
84 |
2525 |
16.0 |
82 |
dodge aries se |
22 |
26.6 |
4 |
151 |
84 |
2635 |
16.4 |
81 |
buick skylark |
25 |
26.4 |
4 |
140 |
88 |
2870 |
18.1 |
80 |
ford fairmont |
Se pueden utilizar además los siguientes operadores lógicos:
y & !x
y - x
x | y
y union x
x & y
y intersección x
xor(x, y)
(y unido x) - (y intersección x)
Mutate
attach(auto)
a3 <- auto%>%mutate(Relacion = mpg / cylinders) %>% filter(Relacion > 7) %>% arrange(desc(mpg))%>%select(-acceleration)
kable(head(a3),"markdown")
269 |
46.6 |
4 |
86 |
65 |
2110 |
80 |
mazda glc |
11.650 |
388 |
44.6 |
4 |
91 |
67 |
1850 |
80 |
honda civic 1500 gl |
11.150 |
335 |
44.3 |
4 |
90 |
48 |
2085 |
80 |
vw rabbit c (diesel) |
11.075 |
150 |
44.0 |
4 |
97 |
52 |
2130 |
82 |
vw pickup |
11.000 |
95 |
43.4 |
4 |
90 |
48 |
2335 |
80 |
vw dasher (diesel) |
10.850 |
263 |
43.1 |
4 |
90 |
48 |
1985 |
78 |
volkswagen rabbit custom diesel |
10.775 |
estu <- read.csv("../Bases de datos/estudiantes.csv")
kable(head(estu),"markdown")
Lucia |
7 |
6 |
5 |
6 |
4 |
Pedro |
7 |
5 |
9 |
4 |
6 |
Ines |
5 |
6 |
9 |
7 |
7 |
Luis |
6 |
6 |
5 |
8 |
7 |
Andres |
9 |
6 |
6 |
6 |
9 |
Ana |
6 |
9 |
7 |
7 |
4 |
Ejemplo: Estudiantes que reprobaron
kable( estu %>% gather("materia","nota",2:6) %>% mutate(Estado = ifelse(nota > 7,"aprobo","reprobo"))%>%filter(Estado == "reprobo"), "markdown")
Lucia |
Matematicas |
7 |
reprobo |
Pedro |
Matematicas |
7 |
reprobo |
Ines |
Matematicas |
5 |
reprobo |
Luis |
Matematicas |
6 |
reprobo |
Ana |
Matematicas |
6 |
reprobo |
Jose |
Matematicas |
6 |
reprobo |
Sonia |
Matematicas |
6 |
reprobo |
Maria |
Matematicas |
6 |
reprobo |
Lucia |
Ciencias |
6 |
reprobo |
Pedro |
Ciencias |
5 |
reprobo |
Ines |
Ciencias |
6 |
reprobo |
Luis |
Ciencias |
6 |
reprobo |
Andres |
Ciencias |
6 |
reprobo |
Carlos |
Ciencias |
3 |
reprobo |
Sonia |
Ciencias |
7 |
reprobo |
Maria |
Ciencias |
6 |
reprobo |
Lucia |
EspaÃ.ol |
5 |
reprobo |
Luis |
EspaÃ.ol |
5 |
reprobo |
Andres |
EspaÃ.ol |
6 |
reprobo |
Ana |
EspaÃ.ol |
7 |
reprobo |
Jose |
EspaÃ.ol |
6 |
reprobo |
Sonia |
EspaÃ.ol |
7 |
reprobo |
Maria |
EspaÃ.ol |
5 |
reprobo |
Lucia |
Historia |
6 |
reprobo |
Pedro |
Historia |
4 |
reprobo |
Ines |
Historia |
7 |
reprobo |
Andres |
Historia |
6 |
reprobo |
Ana |
Historia |
7 |
reprobo |
Jose |
Historia |
6 |
reprobo |
Maria |
Historia |
4 |
reprobo |
Lucia |
EdFisica |
4 |
reprobo |
Pedro |
EdFisica |
6 |
reprobo |
Ines |
EdFisica |
7 |
reprobo |
Luis |
EdFisica |
7 |
reprobo |
Ana |
EdFisica |
4 |
reprobo |
Carlos |
EdFisica |
6 |
reprobo |
Sonia |
EdFisica |
3 |
reprobo |
Maria |
EdFisica |
6 |
reprobo |
groupby y summarise
Con estas se permiten definir las funciones de agregación a aplicar:
- min
- max
- n
- mean
- median
- var
- sd
- first
- last
- nth : enesimo valor
- n_distinct
Ejemplo: Resumen de cada estudiante
Resumen <- estu %>% gather('Materia','Nota',2:6) %>% group_by(Estudiante) %>%
summarise(Promedio = mean(Nota),Minima = min(Nota),Maxima = max(Nota), Materia= n()) %>% arrange(desc(Promedio))
kable(head(Resumen),"markdown")
Andres |
7.2 |
6 |
9 |
5 |
Carlos |
7.2 |
3 |
9 |
5 |
Jose |
7.0 |
6 |
9 |
5 |
Ines |
6.8 |
5 |
9 |
5 |
Ana |
6.6 |
4 |
9 |
5 |
Luis |
6.4 |
5 |
8 |
5 |
Uniendo conjuntos de datos
Existen funciones para uniones entre base de datos las cuales son: bind_cols, bind_rows, left_join, inner_join, semi_join y anti_join.
bind_cols y bind_rows
# Unir dos bases de datos por columnas
bind_cols(colum1 , colum2)
#Unir conjuntos de datos por filas
bind_rows(row1, row2)
inner_join
inner_join(base1, base2, by = c("variable1", " variable2", ...))